Sampling to estimate arbitrary subset sums
نویسندگان
چکیده
Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. Applied to internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router, with, say, a hundred records to be sampled each hour. A subset could be flow records of a worm attack whose signature is only determined after sampling has taken place. The samples taken in the past allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight sensitive sampling scheme without replacement that is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform orders of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items i = 0, ..., n− 1 with weights wi. For each item i, we generate a random number αi ∈ (0, 1) and create a priority qi = wi/αi. The sample S consists of the k highest priority items. Let τ be the (k + 1) highest priority. Each sampled item i in S gets a weight estimate ŵi = max{wi, τ}, while non-sampled items get weight estimate ŵi = 0. Magically, it turns out that the weight estimates are unbiased, that is, E[ŵi] = wi, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates ŵi and ŵj of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller total variance than priority sampling with k+1 items. Very recently, Szegedy has settled this conjecture.
منابع مشابه
Stream sampling for variance-optimal estimation of subset sums
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of ...
متن کاملEfficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VAROPTk, that dominates all previous schemes in terms of ...
متن کاملOn the Variance of Subset Sum Estimation
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries to arbitrary subset sums. With unit weights, we can compute subset sizes which together with the previous sums provide the subset averages. The question addre...
متن کاملTotal Variation Asymptotics for Sums of Independent Integer Random Variables
Let $W_n := \sum_{j=1}^n Z_j$ be a sum of independent integer-valued random variables. In this paper, we derive an asymptotic expansion for the probability $\mathbb{P}[W_n \in A]$ of an arbitrary subset $A \in \mathbb{Z}$. Our approximation improves upon the classical expansions by including an explicit, uniform error estimate, involving only easily computable properties of the distributions of...
متن کاملExtension Theorems for the Fourier Transform Associated with Non-degenerate Quadratic Surfaces in Vector Spaces over Finite Fields
We study the restriction of the Fourier transform to quadratic surfaces in vector spaces over finite fields. In two dimensions, we obtain the sharp result by considering the sums of arbitrary two elements in the subset of quadratic surfaces on two dimensional vector spaces over finite fields. For higher dimensions, we estimate the decay of the Fourier transform of the characteristic functions o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/cs/0509026 شماره
صفحات -
تاریخ انتشار 2005